Source code: https://github.com/djlofland/DS621_F2020_Group3/tree/master/Homework_4

1. Data Exploration

Describe the size and the variables in the insurance training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a checklist of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.

Given that the Index column had no impact on the target variable, it was dropped as part of the initial cleaning function. Additionally, the fields “INCOME”, “HOME_VAL”, “OLDCLAIM”, and, “BLUEBOOK”, were imported as characters with “$” leaders and were converted to numeric as part of the initial cleaning function. both the training and evaluation datasets will pass through this treatment.

Summary Stats

We compiled summary statistics on our data set to better understand the data before modeling.

##   TARGET_FLAG       TARGET_AMT        KIDSDRIV           AGE       
##  Min.   :0.0000   Min.   :     0   Min.   :0.0000   Min.   :16.00  
##  1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000   1st Qu.:39.00  
##  Median :0.0000   Median :     0   Median :0.0000   Median :45.00  
##  Mean   :0.2638   Mean   :  1504   Mean   :0.1711   Mean   :44.79  
##  3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000   3rd Qu.:51.00  
##  Max.   :1.0000   Max.   :107586   Max.   :4.0000   Max.   :81.00  
##                                                     NA's   :6      
##     HOMEKIDS           YOJ           INCOME         PARENT1         
##  Min.   :0.0000   Min.   : 0.0   Min.   :     0   Length:8161       
##  1st Qu.:0.0000   1st Qu.: 9.0   1st Qu.: 28097   Class :character  
##  Median :0.0000   Median :11.0   Median : 54028   Mode  :character  
##  Mean   :0.7212   Mean   :10.5   Mean   : 61898                     
##  3rd Qu.:1.0000   3rd Qu.:13.0   3rd Qu.: 85986                     
##  Max.   :5.0000   Max.   :23.0   Max.   :367030                     
##                   NA's   :454    NA's   :445                        
##     HOME_VAL        MSTATUS              SEX             EDUCATION        
##  Min.   :     0   Length:8161        Length:8161        Length:8161       
##  1st Qu.:     0   Class :character   Class :character   Class :character  
##  Median :161160   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :154867                                                           
##  3rd Qu.:238724                                                           
##  Max.   :885282                                                           
##  NA's   :464                                                              
##      JOB               TRAVTIME        CAR_USE             BLUEBOOK    
##  Length:8161        Min.   :  5.00   Length:8161        Min.   : 1500  
##  Class :character   1st Qu.: 22.00   Class :character   1st Qu.: 9280  
##  Mode  :character   Median : 33.00   Mode  :character   Median :14440  
##                     Mean   : 33.49                      Mean   :15710  
##                     3rd Qu.: 44.00                      3rd Qu.:20850  
##                     Max.   :142.00                      Max.   :69740  
##                                                                        
##       TIF           CAR_TYPE           RED_CAR             OLDCLAIM    
##  Min.   : 1.000   Length:8161        Length:8161        Min.   :    0  
##  1st Qu.: 1.000   Class :character   Class :character   1st Qu.:    0  
##  Median : 4.000   Mode  :character   Mode  :character   Median :    0  
##  Mean   : 5.351                                         Mean   : 4037  
##  3rd Qu.: 7.000                                         3rd Qu.: 4636  
##  Max.   :25.000                                         Max.   :57037  
##                                                                        
##     CLM_FREQ        REVOKED             MVR_PTS          CAR_AGE      
##  Min.   :0.0000   Length:8161        Min.   : 0.000   Min.   :-3.000  
##  1st Qu.:0.0000   Class :character   1st Qu.: 0.000   1st Qu.: 1.000  
##  Median :0.0000   Mode  :character   Median : 1.000   Median : 8.000  
##  Mean   :0.7986                      Mean   : 1.696   Mean   : 8.328  
##  3rd Qu.:2.0000                      3rd Qu.: 3.000   3rd Qu.:12.000  
##  Max.   :5.0000                      Max.   :13.000   Max.   :28.000  
##                                                       NA's   :510     
##   URBANICITY       
##  Length:8161       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Check Class Bias

Next, we wanted to get an idea of the distribution profiles for each of the variables. We have two target values, 0 and 1. When building models, we ideally want an equal representation of both classes. As class imbalance deviates, our model performance will suffer both form effects of differential variance between the classes and bias towards picking the more represented class. For logistic regression, if we see a strong imbalance, we can 1) up-sample the smaller group, down-sample the larger group, or adjust our threshold for assigning the predicted value away from 0.5.

Classification of Target Flag
Var1 Freq
0 0.7361843
1 0.2638157

The classes are not perfectly balanced, with approximately 73.6% 0’s and 26.4% 1’s. With unbalanced class distributions, it is often necessary to artificially balance the classes to achieve good results. Up-sampling or Down-sampling may be required to achieve class balance with this dataset. We will evaluate model performance accordingly.

Distributions

Next, we visualize the distribution profiles for each of the predictor variables. This will help us to make a plan on which variable to include, how they might be related to each other or the target, and finally identify outliers or transformations that might help improve model resolution.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution profiles show the prevalence of kurtosis, specifically right skew in variables TRAVTIME, OLDCLAIM, MVR_PTS, TARGET_AMT, INCOME, BLUEBOOK, and approximately normal distributions in YOJ, CARAGE, HOME_VAL, and AGE. when deviations are skewed from traditional normal distribution, this can be problematic for regression assumptions, and thus we might need to transform the data. Under logistic regression, we will need to dummy factor-based variables for the model to understand the data.

While we don’t tackle feature engineering in this analysis, if we were performing a more in-depth analysis, we could leverage the package, mixtools (see R Vignette). This package helps regress mixed models where data can be subdivided into subgroups.

Lastly, several features have both a distribution along with a high number of values at an extreme. However, based on the feature meanings and provided information, there is no reason to believe that any of these extreme values are mistakes, data errors, or otherwise inexplicable. As such, we will not remove the extreme values, as they represent valuable data and could be predictive of the target.

Boxplots

In addition to creating histogram distributions, we also elected to use box-plots to get an idea of the spread of the response variable TARGET_AMT in relation to all of the non-numeric variables. Two sets of boxplots are shown below due to the wide distribution of the response variable. The first set of boxplots highlights the entire range and shows how the cost of car crashes peaks relative to the specific category.

The second set of box plots simply shows these same distributions “zoomed in” by adjusting the axis to allow for a visual of the interquartile range of the response variable relative to each of the categorical predictors.

Variable Plots

We wanted to plot scatter plots of each variable versus the target variable to get an idea of the relationship between them. The scatter. There are some notable trends as observed in the scatterplots below such as our response variable TARGET_AMT is likely to be lower when individuals have more kids at home as indicated by the HOMEKIDS feature, and when they have more teenagers driving the car indicated by the feature KIDSDRIV.

Additionally a pairwise comparison plot between all features, both numeric and non-numeric is shown following the scatterplot where this initially implies that there aren’t a significant amount of correlated features and this can give some insight into the expected significance and performing dimensionality reduction on the datasets for the models.

Data Sparsity Check

Finally, we can observe the sparsity of information within our dataset by using the DataExplorer package to assess missing information.

We can see that generally, our dataset is in good shape, however, some imputation may be needed for INCOME, YOJ, HOME_VAL, and CAR_AGE.

2. Data Preparation

To summarize our data preparation and exploration, we can distinguish our findings into a few categories below:

Removed Fields

All the predictor variables have no missing values and show no indication of incomplete or incorrect data. As such, we have kept all the fields.

Missing Values

Missing values will be imputed with the step_impute functions in the tidy models recipes.

Outliers

No outliers were removed as all values seemed reasonable.

Transform non-normal variables

Finally, as mentioned earlier in our data exploration, and our findings from our histogram plots, we can see that some of our variables are highly skewed. To address this, we decided to perform some transformations to make them more normally distributed. Here are some plots to demonstrate the changes in distributions before and after the transformations:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3. Build Models

Using the training data, build at least two different multiple linear regression models and three different binary logistic models, using different variables (or the same variables with different transformations). You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done. Be sure to explain how you can make inferences from the model, as well as discuss other relevant model output. Discuss the coefficients in the models, do they make sense? Are you keeping the model even though it is counter-intuitive? Why? The boss needs to know.

Linear Regression

Linear Regression Model 1

first model, no transformations, all nominal predictors included